#3¶

Kaggle competition: [link]

Entry by Robin P.M. Kras

⭐ 1. Introduction & Overview¶

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

🔹 2. Import Libraries & Set Up¶

In [255]:
# General
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
import xgboost as xg
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error, r2_score, root_mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression

# Feature Importance & Explainability
import shap

# Settings
import warnings
warnings.filterwarnings("ignore")

# Set random seed for reproducibility
SEED = 42
np.random.seed(SEED)

print("Libraries loaded. Ready to go!")
Libraries loaded. Ready to go!

🔹 3. Load & Explore Data¶

In [256]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
In [257]:
train.head()
Out[257]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [258]:
train.shape
Out[258]:
(1460, 81)
In [259]:
train.isnull().sum()
Out[259]:
Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64
In [260]:
# Quick summary of dataset
train.describe()
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     588 non-null    object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

🔹 4. Data Visualization & EDA¶

In [261]:
float_cols = [col for col in train.columns if train[col].dtype == "float64"]

cols_per_row = 3
num_plots = len(float_cols)
rows = (num_plots // cols_per_row) + (num_plots % cols_per_row > 0) 

fig, axes = plt.subplots(rows, cols_per_row, figsize=(15, 5 * rows)) 
axes = axes.flatten()  

for idx, col in enumerate(float_cols):
    sns.histplot(train[col], bins=50, kde=True, ax=axes[idx])
    axes[idx].set_title(f"Distribution of {col}")

for i in range(idx + 1, len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()
No description has been provided for this image
In [262]:
categorical_features = train.select_dtypes(include=['object']).columns

num_features = len(categorical_features)
cols = 3 
rows = (num_features // cols) + (num_features % cols > 0) 

# Create subplots
fig, axes = plt.subplots(rows, cols, figsize=(15, rows * 5)) 
axes = axes.flatten()  

for i, feature in enumerate(categorical_features):
    train[feature].value_counts().plot.pie(
        autopct='%1.1f%%', ax=axes[i], startangle=90, cmap="viridis"
    )
    axes[i].set_title(feature)
    axes[i].set_ylabel("") 

# Hide any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()
No description has been provided for this image
In [263]:
heatmap_train = pd.DataFrame()

for col in train.columns:
    if train[col].dtype == "float64" or train[col].dtype == "int64":
        heatmap_train[col] = train[col]

plt.figure(figsize=(30,12))
sns.heatmap(heatmap_train.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()
No description has been provided for this image
In [264]:
heatmap_train = train.select_dtypes(include=["float64", "int64"])

corr_matrix = heatmap_train.corr()

threshold = 0.75

high_corr_pairs = (
    corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) 
    .stack()  
    .reset_index()
)

high_corr_pairs.columns = ["Feature 1", "Feature 2", "Correlation"]
high_corr_pairs = high_corr_pairs[high_corr_pairs["Correlation"].abs() > threshold]  

plt.figure(figsize=(30, 12))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()

print("Highly correlated feature pairs (above threshold):")
print(high_corr_pairs)
No description has been provided for this image
Highly correlated feature pairs (above threshold):
       Feature 1     Feature 2  Correlation
174  OverallQual     SalePrice     0.790982
225    YearBuilt   GarageYrBlt     0.825667
378  TotalBsmtSF      1stFlrSF     0.819530
478    GrLivArea  TotRmsAbvGrd     0.825489
637   GarageCars    GarageArea     0.882475
In [265]:
#interesting_features = ["OverallQual", "YearBuilt", "GarageYrBlt", "TotalBsmtSF", "1stFlrSF", "GrLivArea", "TotRmsAbvGrd", "GarageCars", "GarageArea"]

l1 = high_corr_pairs['Feature 1'].tolist()
l2 = high_corr_pairs['Feature 2'].tolist()
interesting_features = list(set(l1+l2))

interesting_features.remove('SalePrice')

print(interesting_features)
['GarageYrBlt', 'TotalBsmtSF', 'OverallQual', '1stFlrSF', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'GrLivArea', 'YearBuilt']

🔹 5. Feature Engineering¶

In [266]:
train.columns = train.columns.str.strip()
test.columns = test.columns.str.strip()
In [267]:
print(f"Train set, null count: \n{train.isnull().sum()}")
print("\n")
print(f"Test set, null count: \n{test.isnull().sum()}")
Train set, null count: 
Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64


Test set, null count: 
Id                 0
MSSubClass         0
MSZoning           4
LotFrontage      227
LotArea            0
                ... 
MiscVal            0
MoSold             0
YrSold             0
SaleType           1
SaleCondition      0
Length: 80, dtype: int64
In [268]:
train["LotFrontage"] = train.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))
test["LotFrontage"] = test.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    train[col] = train[col].fillna('None')
    test[col] = test[col].fillna('None')

for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    train[col] = train[col].fillna(0)
    test[col] = test[col].fillna(0)

train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']
test['TotalSF'] = test['TotalBsmtSF'] + test['1stFlrSF'] + test['2ndFlrSF']
In [269]:
for col in train.columns:
    if train[col].dtype == "object":
        train[col] = train[col].fillna("None")
    elif train[col].dtype in ["float64", "int64"]:
        train[col] = train[col].fillna(train[col].mean())

for col in test.columns:
    if test[col].dtype == "object":
        test[col] = test[col].fillna("None")
    elif test[col].dtype in ["float64", "int64"]:
        test[col] = test[col].fillna(test[col].mean())    
In [270]:
for col in train.columns:
    if train[col].isnull().sum() > 0:
        print(col)

for col in test.columns:
    if test[col].isnull().sum() > 0:
        print(col)

No more empty items left. Great!

In [271]:
import itertools

def create_combination_features(df, features):
    combinations = itertools.combinations(features, 2)

    for comb in combinations:
        feature_name = "_".join(comb)
        df[feature_name] = df[list(comb)].mean(axis=1)
    
    return df

train = create_combination_features(train, interesting_features)
test = create_combination_features(test, interesting_features)
In [272]:
train.head()
Out[272]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... TotRmsAbvGrd_GarageCars TotRmsAbvGrd_GarageArea TotRmsAbvGrd_GrLivArea TotRmsAbvGrd_YearBuilt GarageCars_GarageArea GarageCars_GrLivArea GarageCars_YearBuilt GarageArea_GrLivArea GarageArea_YearBuilt GrLivArea_YearBuilt
0 1 60 RL 65.0 8450 Pave None Reg Lvl AllPub ... 5.0 278.0 859.0 1005.5 275.0 856.0 1002.5 1129.0 1275.5 1856.5
1 2 20 RL 80.0 9600 Pave None Reg Lvl AllPub ... 4.0 233.0 634.0 991.0 231.0 632.0 989.0 861.0 1218.0 1619.0
2 3 60 RL 68.0 11250 Pave None IR1 Lvl AllPub ... 4.0 307.0 896.0 1003.5 305.0 894.0 1001.5 1197.0 1304.5 1893.5
3 4 70 RL 60.0 9550 Pave None IR1 Lvl AllPub ... 5.0 324.5 862.0 961.0 322.5 860.0 959.0 1179.5 1278.5 1816.0
4 5 60 RL 84.0 14260 Pave None IR1 Lvl AllPub ... 6.0 422.5 1103.5 1004.5 419.5 1100.5 1001.5 1517.0 1418.0 2099.0

5 rows × 118 columns

In [273]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in train.columns:
    if train[col].dtype == "object":
        train[col] = le.fit_transform(train[col])

for col in test.columns:
    if test[col].dtype == "object":
        test[col] = le.fit_transform(test[col])

🔹 6. Model Selection¶

In [274]:
X = train.drop(columns=["Id", "SalePrice"])
X_test = test.drop(columns=["Id"])

y = train['SalePrice']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=SEED)
In [275]:
param_grid = {
    #'n_estimators': [100, 200, 500],  
    #'learning_rate': [0.01, 0.05, 0.1],  
    #'max_depth': [3, 5, 7, 9],  
    #'subsample': [0.8, 0.9, 1.0], 
    #'colsample_bytree': [0.8, 0.9, 1.0],
    'alpha': [0, 0.01, 0.1, 1],
    'lambda': [0, 0.1, 0.5, 1],
    'gamma': [0, 0.1, 0.2, 1],
    'early_stopping_rounds': [5, 10, 20, 30]
}

grid_search = GridSearchCV(xg.XGBRegressor(tree_method="gpu_hist", random_state=SEED), param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train, 
            eval_set=[(X_train, y_train), (X_val, y_val)])

print("Best Parameters:", grid_search.best_params_)

best_params = grid_search.best_params_
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[275], line 14
      1 param_grid = {
      2     #'n_estimators': [100, 200, 500],  
      3     #'learning_rate': [0.01, 0.05, 0.1],  
   (...)
     10     'early_stopping_rounds': [5, 10, 20, 30]
     11 }
     13 grid_search = GridSearchCV(xg.XGBRegressor(tree_method="gpu_hist", random_state=SEED), param_grid, cv=5, n_jobs=-1)
---> 14 grid_search.fit(X_train, y_train, 
     15             eval_set=[(X_train, y_train), (X_val, y_val)])
     17 print("Best Parameters:", grid_search.best_params_)
     19 best_params = grid_search.best_params_

File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1466     estimator._validate_params()
   1468 with config_context(
   1469     skip_parameter_validation=(
   1470         prefer_skip_nested_validation or global_skip_validation
   1471     )
   1472 ):
-> 1473     return fit_method(estimator, *args, **kwargs)

File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:1018, in BaseSearchCV.fit(self, X, y, **params)
   1012     results = self._format_results(
   1013         all_candidate_params, n_splits, all_out, all_more_results
   1014     )
   1016     return results
-> 1018 self._run_search(evaluate_candidates)
   1020 # multimetric is determined here because in the case of a callable
   1021 # self.scoring the return type is only known after calling
   1022 first_test_score = all_out[0]["test_scores"]

File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:1572, in GridSearchCV._run_search(self, evaluate_candidates)
   1570 def _run_search(self, evaluate_candidates):
   1571     """Search all candidates in param_grid"""
-> 1572     evaluate_candidates(ParameterGrid(self.param_grid))

File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:964, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    956 if self.verbose > 0:
    957     print(
    958         "Fitting {0} folds for each of {1} candidates,"
    959         " totalling {2} fits".format(
    960             n_splits, n_candidates, n_candidates * n_splits
    961         )
    962     )
--> 964 out = parallel(
    965     delayed(_fit_and_score)(
    966         clone(base_estimator),
    967         X,
    968         y,
    969         train=train,
    970         test=test,
    971         parameters=parameters,
    972         split_progress=(split_idx, n_splits),
    973         candidate_progress=(cand_idx, n_candidates),
    974         **fit_and_score_kwargs,
    975     )
    976     for (cand_idx, parameters), (split_idx, (train, test)) in product(
    977         enumerate(candidate_params),
    978         enumerate(cv.split(X, y, **routed_params.splitter.split)),
    979     )
    980 )
    982 if len(out) < 1:
    983     raise ValueError(
    984         "No fits were performed. "
    985         "Was the CV iterator empty? "
    986         "Were there no candidates?"
    987     )

File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\parallel.py:74, in Parallel.__call__(self, iterable)
     69 config = get_config()
     70 iterable_with_config = (
     71     (_with_config(delayed_func, config), args, kwargs)
     72     for delayed_func, args, kwargs in iterable
     73 )
---> 74 return super().__call__(iterable_with_config)

File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py:2007, in Parallel.__call__(self, iterable)
   2001 # The first item from the output is blank, but it makes the interpreter
   2002 # progress until it enters the Try/Except block of the generator and
   2003 # reaches the first `yield` statement. This starts the asynchronous
   2004 # dispatch of the tasks to the workers.
   2005 next(output)
-> 2007 return output if self.return_generator else list(output)

File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py:1650, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1647     yield
   1649     with self._backend.retrieval_context():
-> 1650         yield from self._retrieve()
   1652 except GeneratorExit:
   1653     # The generator has been garbage collected before being fully
   1654     # consumed. This aborts the remaining tasks if possible and warn
   1655     # the user if necessary.
   1656     self._exception = True

File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py:1762, in Parallel._retrieve(self)
   1757 # If the next job is not ready for retrieval yet, we just wait for
   1758 # async callbacks to progress.
   1759 if ((len(self._jobs) == 0) or
   1760     (self._jobs[0].get_status(
   1761         timeout=self.timeout) == TASK_PENDING)):
-> 1762     time.sleep(0.01)
   1763     continue
   1765 # We need to be careful: the job list can be filling up as
   1766 # we empty it and Python list are not thread-safe by
   1767 # default hence the use of the lock

KeyboardInterrupt: 
In [282]:
model = xg.XGBRegressor(
    n_estimators=200, 
    learning_rate=0.1, 
    max_depth=6,
    early_stopping_rounds=30,
    random_state=SEED)

model.fit(X_train, y_train, 
            eval_set=[(X_train, y_train), (X_val, y_val)])

results = model.evals_result()

plt.figure(figsize=(10,7))
plt.plot(results["validation_0"]["rmse"], label="Training loss")
plt.plot(results["validation_1"]["rmse"], label="Validation loss")
plt.axvline(21, color="gray", label="Optimal tree number")
plt.xlabel("Number of trees")
plt.ylabel("Loss")
plt.legend()

predictions = model.predict(X_val)

mse = mean_squared_error(y_val, predictions)
mae = mean_absolute_error(y_val, predictions)
r2 = r2_score(y_val, predictions)
rms = root_mean_squared_error(y_val, predictions)

print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R² Score: {r2}")
print(f"RMSE Score: {rms}")
[0]	validation_0-rmse:70970.99675	validation_1-rmse:81185.69429
[1]	validation_0-rmse:65389.63001	validation_1-rmse:75536.54372
[2]	validation_0-rmse:60325.50352	validation_1-rmse:70391.48674
[3]	validation_0-rmse:55779.77828	validation_1-rmse:65897.70937
[4]	validation_0-rmse:51688.05832	validation_1-rmse:61486.43675
[5]	validation_0-rmse:48065.76245	validation_1-rmse:57493.91657
[6]	validation_0-rmse:44762.08765	validation_1-rmse:54121.95396
[7]	validation_0-rmse:41787.54140	validation_1-rmse:51068.12859
[8]	validation_0-rmse:39174.55442	validation_1-rmse:48461.50333
[9]	validation_0-rmse:36846.50650	validation_1-rmse:46160.89333
[10]	validation_0-rmse:34761.88632	validation_1-rmse:44186.04997
[11]	validation_0-rmse:32901.03478	validation_1-rmse:42473.86017
[12]	validation_0-rmse:31105.18465	validation_1-rmse:40959.48824
[13]	validation_0-rmse:29542.99758	validation_1-rmse:39595.32568
[14]	validation_0-rmse:28124.51338	validation_1-rmse:38338.29472
[15]	validation_0-rmse:26930.49823	validation_1-rmse:37394.57382
[16]	validation_0-rmse:25759.82301	validation_1-rmse:36433.66961
[17]	validation_0-rmse:24692.30199	validation_1-rmse:35612.50055
[18]	validation_0-rmse:23724.48250	validation_1-rmse:34889.01702
[19]	validation_0-rmse:22904.96392	validation_1-rmse:34120.27285
[20]	validation_0-rmse:22146.97539	validation_1-rmse:33488.32512
[21]	validation_0-rmse:21451.91814	validation_1-rmse:32986.14746
[22]	validation_0-rmse:20844.12469	validation_1-rmse:32514.35349
[23]	validation_0-rmse:20278.17371	validation_1-rmse:32265.58492
[24]	validation_0-rmse:19769.23424	validation_1-rmse:32065.88325
[25]	validation_0-rmse:19330.95042	validation_1-rmse:31783.42352
[26]	validation_0-rmse:18938.96910	validation_1-rmse:31452.40945
[27]	validation_0-rmse:18469.84238	validation_1-rmse:31193.25027
[28]	validation_0-rmse:18056.81178	validation_1-rmse:30963.57389
[29]	validation_0-rmse:17699.04077	validation_1-rmse:30800.24726
[30]	validation_0-rmse:17380.23489	validation_1-rmse:30656.27114
[31]	validation_0-rmse:17042.93197	validation_1-rmse:30421.24088
[32]	validation_0-rmse:16728.89263	validation_1-rmse:30250.88544
[33]	validation_0-rmse:16402.86279	validation_1-rmse:30290.47945
[34]	validation_0-rmse:16063.11172	validation_1-rmse:30207.54612
[35]	validation_0-rmse:15800.82453	validation_1-rmse:30202.17239
[36]	validation_0-rmse:15593.26356	validation_1-rmse:30148.52294
[37]	validation_0-rmse:15362.04726	validation_1-rmse:30144.52077
[38]	validation_0-rmse:15209.44196	validation_1-rmse:30077.06972
[39]	validation_0-rmse:15071.16011	validation_1-rmse:30031.71573
[40]	validation_0-rmse:14855.41890	validation_1-rmse:30008.15365
[41]	validation_0-rmse:14729.30351	validation_1-rmse:29971.12883
[42]	validation_0-rmse:14619.09227	validation_1-rmse:30000.85843
[43]	validation_0-rmse:14516.68055	validation_1-rmse:30017.30245
[44]	validation_0-rmse:14434.56843	validation_1-rmse:29985.43943
[45]	validation_0-rmse:14267.31069	validation_1-rmse:29987.22929
[46]	validation_0-rmse:14186.62625	validation_1-rmse:30063.01536
[47]	validation_0-rmse:14114.96128	validation_1-rmse:30071.45333
[48]	validation_0-rmse:14045.32317	validation_1-rmse:30136.62240
[49]	validation_0-rmse:13878.71972	validation_1-rmse:30132.72088
[50]	validation_0-rmse:13756.14819	validation_1-rmse:30133.40163
[51]	validation_0-rmse:13670.45072	validation_1-rmse:30137.18406
[52]	validation_0-rmse:13589.24230	validation_1-rmse:30189.83340
[53]	validation_0-rmse:13534.02417	validation_1-rmse:30159.16439
[54]	validation_0-rmse:13402.63195	validation_1-rmse:30209.28167
[55]	validation_0-rmse:13325.79715	validation_1-rmse:30179.88467
[56]	validation_0-rmse:13252.86668	validation_1-rmse:30157.87934
[57]	validation_0-rmse:13189.35689	validation_1-rmse:30120.98764
[58]	validation_0-rmse:13122.00641	validation_1-rmse:30111.69700
[59]	validation_0-rmse:13073.39773	validation_1-rmse:30121.59937
[60]	validation_0-rmse:13016.42063	validation_1-rmse:30109.56940
[61]	validation_0-rmse:12905.79256	validation_1-rmse:30123.86148
[62]	validation_0-rmse:12817.74617	validation_1-rmse:30118.21006
[63]	validation_0-rmse:12778.45647	validation_1-rmse:30104.93015
[64]	validation_0-rmse:12693.12025	validation_1-rmse:30125.27322
[65]	validation_0-rmse:12521.20022	validation_1-rmse:30167.86210
[66]	validation_0-rmse:12439.29640	validation_1-rmse:30171.20592
[67]	validation_0-rmse:12375.86631	validation_1-rmse:30161.93348
[68]	validation_0-rmse:12309.27041	validation_1-rmse:30164.23156
[69]	validation_0-rmse:12269.31452	validation_1-rmse:30149.12402
[70]	validation_0-rmse:12217.43429	validation_1-rmse:30152.37377
[71]	validation_0-rmse:12163.80516	validation_1-rmse:30166.35987
Mean Squared Error: 898268566.4025571
Mean Absolute Error: 19523.425660851884
R² Score: 0.8828904032707214
RMSE Score: 29971.12888101743
No description has been provided for this image
In [277]:
X_test = test.drop(columns=['Id'])  

predictions = model.predict(X_test)  

output = pd.DataFrame({'Id': test['Id'], 'SalePrice': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")
Your submission was successfully saved!

🔹 Experiment¶

In [292]:
y = train["SalePrice"]

X = pd.get_dummies(train.drop(columns=["SalePrice"]))
X_test = pd.get_dummies(test)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=SEED)

model = xg.XGBRegressor(
    n_estimators=200, 
    learning_rate=0.1, 
    max_depth=6,
    early_stopping_rounds=20,
    alpha=0.1,
    lambda_=0.1,
    gamma=0.1,
    random_state=SEED)
    
model.fit(X, y, 
            eval_set=[(X_train, y_train), (X_val, y_val)])


predictions = model.predict(X_test)
predictions_val = model.predict(X_val)

mse = mean_squared_error(y_val, predictions_val)
mae = mean_absolute_error(y_val, predictions_val)
r2 = r2_score(y_val, predictions_val)
rms = root_mean_squared_error(y_val, predictions_val)

print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R² Score: {r2}")
print(f"RMSE Score: {rms}")

output = pd.DataFrame({'Id': test['Id'], 'SalePrice': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")
[0]	validation_0-rmse:70587.64987	validation_1-rmse:80031.70279
[1]	validation_0-rmse:64649.68176	validation_1-rmse:73184.89989
[1]	validation_0-rmse:64649.68176	validation_1-rmse:73184.89989
[2]	validation_0-rmse:59242.00549	validation_1-rmse:67185.38236
[3]	validation_0-rmse:54359.50865	validation_1-rmse:61895.44966
[4]	validation_0-rmse:49975.55922	validation_1-rmse:56932.04077
[5]	validation_0-rmse:46056.33986	validation_1-rmse:52665.26736
[6]	validation_0-rmse:42466.32732	validation_1-rmse:48643.13051
[7]	validation_0-rmse:39249.99216	validation_1-rmse:45025.81572
[8]	validation_0-rmse:36330.76481	validation_1-rmse:41553.31432
[9]	validation_0-rmse:33654.57960	validation_1-rmse:38626.54093
[10]	validation_0-rmse:31313.60302	validation_1-rmse:35821.70997
[11]	validation_0-rmse:29110.11036	validation_1-rmse:33341.97377
[12]	validation_0-rmse:27103.52663	validation_1-rmse:31081.01471
[13]	validation_0-rmse:25287.70466	validation_1-rmse:29084.07657
[14]	validation_0-rmse:23659.23322	validation_1-rmse:27250.29113
[15]	validation_0-rmse:22147.78931	validation_1-rmse:25627.62607
[16]	validation_0-rmse:20838.57970	validation_1-rmse:24102.99901
[17]	validation_0-rmse:19650.22950	validation_1-rmse:22670.59125
[18]	validation_0-rmse:18587.57553	validation_1-rmse:21453.46037
[19]	validation_0-rmse:17613.47630	validation_1-rmse:20314.02078
[20]	validation_0-rmse:16757.09225	validation_1-rmse:19266.49970
[21]	validation_0-rmse:15981.54455	validation_1-rmse:18293.96320
[22]	validation_0-rmse:15285.74510	validation_1-rmse:17462.45386
[23]	validation_0-rmse:14674.46288	validation_1-rmse:16632.30008
[24]	validation_0-rmse:14073.28338	validation_1-rmse:15877.35422
[25]	validation_0-rmse:13541.73838	validation_1-rmse:15201.78559
[26]	validation_0-rmse:13063.53557	validation_1-rmse:14571.93830
[27]	validation_0-rmse:12618.95625	validation_1-rmse:13982.81317
[28]	validation_0-rmse:12213.15044	validation_1-rmse:13416.22001
[29]	validation_0-rmse:11831.96912	validation_1-rmse:12885.53252
[30]	validation_0-rmse:11485.12523	validation_1-rmse:12391.00656
[31]	validation_0-rmse:11175.60498	validation_1-rmse:11957.29005
[32]	validation_0-rmse:10851.96172	validation_1-rmse:11575.84600
[33]	validation_0-rmse:10566.09478	validation_1-rmse:11211.75370
[34]	validation_0-rmse:10322.16180	validation_1-rmse:10813.40977
[35]	validation_0-rmse:10124.33637	validation_1-rmse:10501.44477
[36]	validation_0-rmse:9889.28919	validation_1-rmse:10142.20931
[37]	validation_0-rmse:9677.60613	validation_1-rmse:9835.03811
[38]	validation_0-rmse:9476.21101	validation_1-rmse:9602.87599
[39]	validation_0-rmse:9300.46041	validation_1-rmse:9388.69295
[40]	validation_0-rmse:9110.44486	validation_1-rmse:9204.20090
[41]	validation_0-rmse:8952.29241	validation_1-rmse:9034.83823
[42]	validation_0-rmse:8765.06154	validation_1-rmse:8835.48348
[43]	validation_0-rmse:8625.50611	validation_1-rmse:8670.94196
[44]	validation_0-rmse:8493.23515	validation_1-rmse:8500.11982
[45]	validation_0-rmse:8350.93387	validation_1-rmse:8374.53401
[46]	validation_0-rmse:8225.78918	validation_1-rmse:8167.25119
[47]	validation_0-rmse:8099.35086	validation_1-rmse:8071.30511
[48]	validation_0-rmse:7984.65752	validation_1-rmse:7922.81929
[49]	validation_0-rmse:7886.92947	validation_1-rmse:7850.71196
[50]	validation_0-rmse:7794.60942	validation_1-rmse:7755.77065
[51]	validation_0-rmse:7708.24187	validation_1-rmse:7658.16723
[52]	validation_0-rmse:7620.49625	validation_1-rmse:7551.95646
[53]	validation_0-rmse:7569.74714	validation_1-rmse:7481.04963
[54]	validation_0-rmse:7465.39791	validation_1-rmse:7379.62983
[55]	validation_0-rmse:7371.19594	validation_1-rmse:7273.97990
[56]	validation_0-rmse:7278.57362	validation_1-rmse:7198.43798
[57]	validation_0-rmse:7173.35822	validation_1-rmse:7114.04825
[58]	validation_0-rmse:7136.87679	validation_1-rmse:7069.84567
[59]	validation_0-rmse:7058.10708	validation_1-rmse:6987.45067
[60]	validation_0-rmse:6988.97546	validation_1-rmse:6912.30361
[61]	validation_0-rmse:6892.58579	validation_1-rmse:6835.89551
[62]	validation_0-rmse:6819.68506	validation_1-rmse:6757.16717
[63]	validation_0-rmse:6748.45368	validation_1-rmse:6685.19423
[64]	validation_0-rmse:6681.94260	validation_1-rmse:6642.21978
[65]	validation_0-rmse:6601.94817	validation_1-rmse:6572.81010
[66]	validation_0-rmse:6505.71789	validation_1-rmse:6478.77918
[67]	validation_0-rmse:6449.04379	validation_1-rmse:6407.14816
[68]	validation_0-rmse:6404.63080	validation_1-rmse:6377.95603
[69]	validation_0-rmse:6357.67795	validation_1-rmse:6330.71900
[70]	validation_0-rmse:6299.29617	validation_1-rmse:6280.65951
[71]	validation_0-rmse:6235.32342	validation_1-rmse:6214.64317
[72]	validation_0-rmse:6208.95556	validation_1-rmse:6180.39512
[73]	validation_0-rmse:6157.02330	validation_1-rmse:6128.25658
[74]	validation_0-rmse:6127.79588	validation_1-rmse:6105.01079
[75]	validation_0-rmse:6051.12253	validation_1-rmse:6032.61492
[76]	validation_0-rmse:6010.87126	validation_1-rmse:5993.43207
[77]	validation_0-rmse:5993.50053	validation_1-rmse:5976.64015
[78]	validation_0-rmse:5939.30511	validation_1-rmse:5929.25695
[79]	validation_0-rmse:5912.48947	validation_1-rmse:5902.42666
[80]	validation_0-rmse:5815.24231	validation_1-rmse:5832.58434
[81]	validation_0-rmse:5797.36301	validation_1-rmse:5814.00680
[82]	validation_0-rmse:5749.41861	validation_1-rmse:5793.56690
[83]	validation_0-rmse:5677.89215	validation_1-rmse:5743.61818
[84]	validation_0-rmse:5633.00051	validation_1-rmse:5699.19558
[85]	validation_0-rmse:5594.40849	validation_1-rmse:5653.17000
[86]	validation_0-rmse:5580.69699	validation_1-rmse:5640.91717
[87]	validation_0-rmse:5566.39635	validation_1-rmse:5613.70050
[88]	validation_0-rmse:5520.50828	validation_1-rmse:5560.43383
[89]	validation_0-rmse:5489.74263	validation_1-rmse:5529.57610
[90]	validation_0-rmse:5449.62668	validation_1-rmse:5510.24936
[91]	validation_0-rmse:5418.45678	validation_1-rmse:5473.16424
[92]	validation_0-rmse:5388.37255	validation_1-rmse:5440.24021
[93]	validation_0-rmse:5362.28881	validation_1-rmse:5425.56233
[94]	validation_0-rmse:5322.91534	validation_1-rmse:5389.84024
[95]	validation_0-rmse:5294.96160	validation_1-rmse:5368.44202
[96]	validation_0-rmse:5282.40069	validation_1-rmse:5354.68423
[97]	validation_0-rmse:5242.28312	validation_1-rmse:5322.16064
[98]	validation_0-rmse:5202.26545	validation_1-rmse:5287.06498
[99]	validation_0-rmse:5174.85089	validation_1-rmse:5260.35961
[100]	validation_0-rmse:5144.72702	validation_1-rmse:5232.19433
[101]	validation_0-rmse:5101.61189	validation_1-rmse:5185.68563
[102]	validation_0-rmse:5060.41015	validation_1-rmse:5154.68045
[103]	validation_0-rmse:5025.89247	validation_1-rmse:5117.58963
[104]	validation_0-rmse:5009.78968	validation_1-rmse:5105.60818
[105]	validation_0-rmse:4988.40165	validation_1-rmse:5087.51981
[106]	validation_0-rmse:4974.69629	validation_1-rmse:5068.66525
[107]	validation_0-rmse:4952.76548	validation_1-rmse:5052.18994
[108]	validation_0-rmse:4887.03583	validation_1-rmse:4976.90300
[109]	validation_0-rmse:4874.04496	validation_1-rmse:4946.23736
[110]	validation_0-rmse:4849.61283	validation_1-rmse:4917.60782
[111]	validation_0-rmse:4835.78535	validation_1-rmse:4899.17147
[112]	validation_0-rmse:4789.57721	validation_1-rmse:4854.19617
[113]	validation_0-rmse:4766.66454	validation_1-rmse:4828.94202
[114]	validation_0-rmse:4727.60927	validation_1-rmse:4803.42931
[115]	validation_0-rmse:4676.17932	validation_1-rmse:4745.15357
[116]	validation_0-rmse:4650.10997	validation_1-rmse:4718.29714
[117]	validation_0-rmse:4619.37273	validation_1-rmse:4690.31852
[118]	validation_0-rmse:4573.83564	validation_1-rmse:4649.45849
[119]	validation_0-rmse:4547.31865	validation_1-rmse:4619.54640
[120]	validation_0-rmse:4491.15697	validation_1-rmse:4558.28503
[121]	validation_0-rmse:4441.91897	validation_1-rmse:4508.46869
[122]	validation_0-rmse:4408.91937	validation_1-rmse:4479.93339
[123]	validation_0-rmse:4361.35768	validation_1-rmse:4429.59941
[124]	validation_0-rmse:4314.05256	validation_1-rmse:4381.95376
[125]	validation_0-rmse:4258.36031	validation_1-rmse:4327.34349
[126]	validation_0-rmse:4244.66227	validation_1-rmse:4302.72297
[127]	validation_0-rmse:4225.99476	validation_1-rmse:4285.42942
[128]	validation_0-rmse:4217.47108	validation_1-rmse:4277.20420
[129]	validation_0-rmse:4189.42821	validation_1-rmse:4246.81239
[130]	validation_0-rmse:4164.32663	validation_1-rmse:4216.79614
[131]	validation_0-rmse:4144.25398	validation_1-rmse:4190.97523
[132]	validation_0-rmse:4112.07313	validation_1-rmse:4148.61536
[133]	validation_0-rmse:4102.16972	validation_1-rmse:4128.64277
[134]	validation_0-rmse:4077.25396	validation_1-rmse:4104.30039
[135]	validation_0-rmse:4056.84414	validation_1-rmse:4085.67558
[136]	validation_0-rmse:4044.31283	validation_1-rmse:4074.29904
[137]	validation_0-rmse:4014.01197	validation_1-rmse:4049.64520
[138]	validation_0-rmse:3992.64410	validation_1-rmse:4028.27538
[139]	validation_0-rmse:3964.81126	validation_1-rmse:3990.36035
[140]	validation_0-rmse:3913.75930	validation_1-rmse:3953.12549
[141]	validation_0-rmse:3890.90020	validation_1-rmse:3941.50577
[142]	validation_0-rmse:3846.02751	validation_1-rmse:3885.46475
[143]	validation_0-rmse:3842.45550	validation_1-rmse:3879.50182
[144]	validation_0-rmse:3823.98294	validation_1-rmse:3871.50742
[145]	validation_0-rmse:3767.71679	validation_1-rmse:3808.29943
[146]	validation_0-rmse:3737.63180	validation_1-rmse:3787.87498
[147]	validation_0-rmse:3719.42349	validation_1-rmse:3767.07622
[148]	validation_0-rmse:3708.98350	validation_1-rmse:3751.79678
[149]	validation_0-rmse:3695.12640	validation_1-rmse:3746.80037
[150]	validation_0-rmse:3679.49772	validation_1-rmse:3730.16574
[151]	validation_0-rmse:3630.48808	validation_1-rmse:3685.79404
[152]	validation_0-rmse:3609.38037	validation_1-rmse:3665.73230
[153]	validation_0-rmse:3576.39377	validation_1-rmse:3626.26843
[154]	validation_0-rmse:3552.64785	validation_1-rmse:3608.97743
[155]	validation_0-rmse:3531.24669	validation_1-rmse:3573.57183
[156]	validation_0-rmse:3503.82710	validation_1-rmse:3554.14229
[157]	validation_0-rmse:3443.69260	validation_1-rmse:3517.59897
[158]	validation_0-rmse:3436.36577	validation_1-rmse:3511.07014
[159]	validation_0-rmse:3415.63803	validation_1-rmse:3496.55568
[160]	validation_0-rmse:3381.27854	validation_1-rmse:3466.26842
[161]	validation_0-rmse:3346.78595	validation_1-rmse:3419.65418
[162]	validation_0-rmse:3319.95810	validation_1-rmse:3386.55903
[163]	validation_0-rmse:3294.38413	validation_1-rmse:3361.04764
[164]	validation_0-rmse:3267.84582	validation_1-rmse:3338.48655
[165]	validation_0-rmse:3245.09032	validation_1-rmse:3324.24743
[166]	validation_0-rmse:3223.23321	validation_1-rmse:3312.40744
[167]	validation_0-rmse:3218.06584	validation_1-rmse:3309.82755
[168]	validation_0-rmse:3203.86517	validation_1-rmse:3300.54403
[169]	validation_0-rmse:3165.38852	validation_1-rmse:3275.33392
[170]	validation_0-rmse:3145.66778	validation_1-rmse:3264.60394
[171]	validation_0-rmse:3124.46120	validation_1-rmse:3242.11976
[172]	validation_0-rmse:3115.78265	validation_1-rmse:3236.92548
[173]	validation_0-rmse:3097.40372	validation_1-rmse:3226.08651
[174]	validation_0-rmse:3074.52883	validation_1-rmse:3205.52744
[175]	validation_0-rmse:3066.97928	validation_1-rmse:3202.01259
[176]	validation_0-rmse:3029.02029	validation_1-rmse:3162.68539
[177]	validation_0-rmse:3018.13234	validation_1-rmse:3159.12957
[178]	validation_0-rmse:2976.18369	validation_1-rmse:3126.48747
[179]	validation_0-rmse:2947.34827	validation_1-rmse:3088.67305
[180]	validation_0-rmse:2913.13894	validation_1-rmse:3052.62399
[181]	validation_0-rmse:2883.91406	validation_1-rmse:3023.61200
[182]	validation_0-rmse:2854.28859	validation_1-rmse:2988.84954
[183]	validation_0-rmse:2835.13490	validation_1-rmse:2972.10465
[184]	validation_0-rmse:2824.07141	validation_1-rmse:2957.72309
[185]	validation_0-rmse:2809.90190	validation_1-rmse:2946.91391
[186]	validation_0-rmse:2784.31360	validation_1-rmse:2923.10220
[187]	validation_0-rmse:2753.50491	validation_1-rmse:2884.24948
[188]	validation_0-rmse:2726.44551	validation_1-rmse:2853.00564
[189]	validation_0-rmse:2719.54164	validation_1-rmse:2848.69755
[190]	validation_0-rmse:2708.18044	validation_1-rmse:2831.37940
[191]	validation_0-rmse:2681.38562	validation_1-rmse:2789.62957
[192]	validation_0-rmse:2642.87954	validation_1-rmse:2756.96800
[193]	validation_0-rmse:2602.38742	validation_1-rmse:2722.92170
[194]	validation_0-rmse:2591.99517	validation_1-rmse:2711.62501
[195]	validation_0-rmse:2567.58094	validation_1-rmse:2692.85217
[196]	validation_0-rmse:2555.80563	validation_1-rmse:2674.29760
[197]	validation_0-rmse:2529.68248	validation_1-rmse:2645.25917
[198]	validation_0-rmse:2507.05234	validation_1-rmse:2624.34462
[199]	validation_0-rmse:2467.73762	validation_1-rmse:2593.10381
Mean Squared Error: 6724187.362976126
Mean Absolute Error: 1856.9438811001712
R² Score: 0.9991233348846436
RMSE Score: 2593.1038087543134
Your submission was successfully saved!